Challenge 9

Author

Camille Heathcock

Creating Nice Tables

Interactive preview of the dataset

Code
#interactice preview of dataset
datatable(a_names)
Warning in instance$preRenderHook(instance): It seems your data is too big
for client-side DataTables. You may consider server-side processing: https://
rstudio.github.io/DT/server.html

Part 1 – Formatting Basic Tables

Use the kable() function from the kable package to create nice, basic HTML tables for every table in this lab.

3) Summarizing & Visualizing the Number of Allisons

  1. Make a summary table of the number of babies named “Allison” for each state and the sex of the baby. Specifically, each state should be its own row and each sex should have its own column. Additionally, if there were no babies born with the name “Allison” for that combination of year, state, & sex the table should have a value of 0 (not an NA).
Code
sum_table_a <- a_names |> 
  pivot_wider(names_from = `Sex at Birth`, values_from = Count) |> 
  filter(Name == "Allison") |> 
  group_by(State) |> 
  summarize(across(.cols = c("F", "M"), .fns = sum)) |> 
  mutate(across(.cols = c("F", "M"), ~replace_na(., 0))) 

kable(head(sum_table_a), booktabs = TRUE) %>%
  kableExtra::kable_styling(font_size = 15)
State F M
AK 232 0
AL 1535 0
AR 1198 0
AZ 1880 0
CA 12413 0
CO 1594 0
Code
allisons_female <- a_names |> 
  filter(Name == "Allison", 
         `Sex at Birth` == "F") |> 
  group_by(State) |> 
  summarize("Female Allisons" = sum(Count))

#kable(head(allisons_female))
datatable(allisons_female)
  1. You should have seen in the table above that “Allison” is a name given overwhelmingly to babies assigned female at birth. So, filter the data to include only babies assigned Female at birth.

  2. Make a visualization showing how the popularity of the name “Allison” has changed over the years.

Code
a_names |> 
  filter(Name == "Allison", 
         `Sex at Birth` == "F") |> 
  ggplot(mapping = aes(x = Year, y = Count)) + 
  geom_col(fill = "purple") + 
  labs(x = "", 
       y = "", 
       title = "Count of Females named Allison (1997-2014)")

4) Modeling the Number of Allisons

  1. Create a linear model with the year as the explanatory variable, and the number of Allisons as the response.
Code
allisons_count <- a_names %>%
  filter(Name == "Allison") %>% 
  lm(Count ~ Year, data = .) 

summary(allisons_count)

Call:
lm(formula = Count ~ Year, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-130.91  -90.77  -43.53   40.94  859.60 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) 3881.8855  1765.2407   2.199   0.0281 *
Year          -1.8758     0.8802  -2.131   0.0333 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 138 on 912 degrees of freedom
Multiple R-squared:  0.004955,  Adjusted R-squared:  0.003864 
F-statistic: 4.542 on 1 and 912 DF,  p-value: 0.03335
  1. Visualize the regression model.
Code
allisons_count |> 
  ggplot(mapping = aes(x = Year, y = Count)) + 
  geom_point() + 
  labs(x = "", 
       y = "", 
       title = "Count of Females named Allison (1997-2014)")

  1. Write out the estimated regression equation.

    y = 3881.89 - 1.88*x + e

  2. Plot the residuals of the model, that is, the actual values minus the predicted values. Comment on the residuals - do you see any patterns?

Code
allisons_count |> 
  broom::augment() |> 
  ggplot(mapping = aes(y = .resid, x = .fitted)) +
  geom_point()

The residuals show a pattern in the plot. They get increasingly more dispersed as the model moves away from zero. This indicates that this may not be the most ideal model for the data.

  1. What do you conclude from this model?

We can conclude that this model may not be the best representation of the data, because the residual graphs shows a slight pattern. A similar model with perhaps more explanatory variables might be e better fit of the data.

5) Spelling by State

  1. Narrow the A name dataset (downloaded previously) down to only male-assigned babies named “Allan”, “Alan”, or “Allen”. Make a plot comparing the popularity of these names over time.
Code
allen_data <- a_names |> 
  filter(Name == "Allan" | Name == "Alan" | Name =="Allen")
Code
allen_data |> 
  ggplot(mapping = aes(x = Name, y = Count, fill = Name)) + 
  geom_col() + 
  theme(legend.position = "none") +
  labs(x = "", 
       y = "", 
       title = "Counts of Allen, Alan, and Allan (1997-2014)")

  1. In California, Allan’s spelling of his name is the least common of the three, but perhaps it’s not such an unusual name for his home state of Pennsylvania. Compute the total number of babies born with each spelling of “Allan” in 2000, in Pennsylvania and in California. Specifically, each spelling should be its own column and each state should have its own row. Similar to before, a 0 (not an NA) should be used to represent locations where there were no instances of these names.
Code
allen_names <- allen_data |> 
  pivot_wider(names_from = Name, values_from = Count) |> 
  filter(Year == 2000, `Sex at Birth` == "M", State == "CA" | State == "PA") |> 
  mutate(across(.cols = Allen:Allan, ~replace_na(., 0)))

kable(head(allen_names), booktabs = TRUE) %>%
  kableExtra::kable_styling(font_size = 15)
Year Sex at Birth State Allen Alan Allan
2000 M CA 176 579 131
2000 M PA 56 51 12
  1. Convert your total counts to overall percents. That is, what was the percent breakdown between the three spellings in CA? What about in PA?
Code
allen_p <- allen_data |> 
  pivot_wider(names_from = Name, values_from = Count) |> 
  mutate(across(.cols = Allen:Allan, ~replace_na(., 0))) |> 
  filter(Year == 2000, `Sex at Birth` == "M", State == "CA" | State == "PA") |> 
  mutate(Allen = round(Allen/ sum(across(Allen:Allan)), 2),
         Alan = round(Alan/ sum(across(Allen:Allan)), 2), 
         Allan = round(Allan/ sum(across(Allen:Allan)), 2))

kable(head(allen_p), booktabs = TRUE) %>%
  kableExtra::kable_styling(font_size = 15)
Year Sex at Birth State Allen Alan Allan
2000 M CA 0.18 0.75 0.91
2000 M PA 0.06 0.07 0.08